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Abstract: In this paper, we have designed and implemented a system for building an Automatic Lexicon for the 
Arabic language. Our Arabic Lexicon contains word specific information. These pieces of information include; 
morphological information such as the root (stem) of the word, its pattern and its affixes, the part-of-speech tag of 
the word, which classifies it as a noun, verb or particle; lexical attributes such as gender, number, person, case, 
definiteness, aspect, and mood are also extracted and stored with the word in the lexicon. A lexicon its a collection 
of representations for words used by a natural language processor as a source of words specific information; this 
representation may contain information about the morphology, phonology, syntactic argument structure and 
semantics of the word. A good lexicon is badly needed for many Natural Language applications such as: parsing, 
text generation, noun phrase and verb phrase construction and so on. Many rules based on the grammar of the 
Arabic language were used in our system to identify the part-of-speech tag and the related lexical attributes of the 
word [13] . We have tested our system using a vowelized and non-vowelized Arabic text documents taken from the 
holly Qur'an and 242 Arabic abstracts chosen randomly from the proceedings of the Saudi Arabian National 
Computer Conference, and we achieved an accuracy of about 96%. We discuss the factors behind these errors and 
how this accuracy rate can be enhanced. 
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1. Introduction 

Lexicography is the branch of applied linguistics 
concerned with the design and construction of lexica 
for practical use. Lexica can range from a paper 
dictionary or encyclopedia designed for human use and 
shelf storage to the electronic lexicon used in a variety 
of human language technology systems, such as word 
databases, word processors, software for read back by 
speech synthesis in Text-to- Speech systems and 
dictation by automatic speech recognition systems. At 
a more generic level, a lexicon may be a lexicographic 
knowledge base from which lexica of all these 
different kinds can be derived automatically. [10] 
Lexicology [10] is the branch of descriptive linguistics 
concerned with the linguistic theory and the 
methodology for describing lexical information, often 
focusing specifically on issues of meaning. 
Traditionally, lexicology has been mainly concerned 
with: 

• Lexical collocations and idioms, 

• Lexical semantics, 

• The structure of word fields and meaning 
components and relations. 



Linguistic theory in the 1990s has gradually been 
integrating these dimensions of lexical information. 
Thus lexical information includes; lexical semantics, 
and the study of the syntactic and morphological and 
phonological properties of words [10]. 
Lexicon theory is the study of the universal, in 
particular, the formal properties of lexica, from the 
points of view of theoretical linguistics, general 
knowledge representation languages in artificial 
intelligence, lexicon construction, access algorithms in 
computational linguistics, or the cognitive conditions 
on human lexical abilities in empirical 
psycholinguistics [10]. 

Lexical knowledge is the knowledge about individual 
words in the language. It is essential for all types of 
natural language processing [12]. 
A lexicon its a collection of representations for words 
used by a linguistic processor as a source of words 
specific information; this representation may contain 
information about the morphology, phonology, 
syntactic argument structure and semantics of the word 
[13]. 

An important question is how to store lexical 
information. The format should be standardized, many 
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programs need a lexicon to accomplish their tasks and 
many people build this lexicon manually [13]. 
Lexicon theorists have increasingly made use of 
extensive lexicological and lexicographic descriptions 
as models for testing their theories, and lexicographers 
are increasingly making use of theoretically interesting 
formalisms such as regular expression calculus in 
order to drive parsing, tagging and learning algorithms 
for extracting lexical information from text corpora. 
Furthermore, the computer has accelerated work in 
practical lexicography, and has also gradually led to a 
convergence within these lexical sciences [7, 10, 23]. 
Lexica are necessary for natural language processing 
systems such as system for information extraction / 
retrieval or dialog systems. For some applications, at 
least, a phrasal lexicon is vitally important [21]. 
Developers of machine translation systems, which 
from the beginning have involved large vocabularies, 
have long recognized the lexicon as a critical system 
resource [18]. 

An important critical step towards avoiding 
duplication of efforts, and consequently towards a 
more productive course of action for the realization of 
resources, is to build and make publicly available to 
the community large-scale lexical resources, with 
broad coverage and basic types of information, generic 
enough to be reusable in different application 
frameworks [18]. 

One application area where lexica are used is speech 
technology, particularly for dictation (speech 
recognition) and readback (text-to-speech) software. 
The size of the Lexicon needed for such applications 
has leapt from a few hundred words in the early 
nineties to tens of thousands today. Software 
technologies are being developed for generating all 
word form variants from the stem forms, and for 
automatically inducing large lexica from text and 
transcription corpora with statistical and symbolic 
classification algorithms. The development of lexica 
for these purposes is a small but growing industry [10]. 

2. Types of information 

The lexicon may contain a wide range of word-specific 
information, depending on the structure and task of the 
natural language processing system [18]. 



A basic lexicon will typically include information 
about morphology. On the syntactic level, the lexicon 
will include in particular the complement structure of 
each word or word sense. A more complex lexicon 
may also include semantic information, such as a 
classification hierarchy and selectional patterns or case 
frames stated in terms of this hierarchy. For machine 
translation the lexicon will also have a record 
correspondences between lexical items in the source 
language and the target language; for speech 
understanding and generation it will have to include 
information about the pronunciation of individual 
words [11, 22]. 

3. The Model of the Arabic word 

A word is defined as an alphanumeric string between 
any two non-alphanumeric characters. An Arabic 
word is a word in which all the characters are bare or 
diacriticized Arabic alphabets characters. It may be 
either an original Arabic word, or an Arabized word. 
The original Arabic words are divided in turn into two 
sub categories: 

• Derived Arabic words: These are the verbs 
and nouns that are built according to the 
Arabic derivation rules. The sweeping 
majority of Arabic words belong to this 
category. 

• Fixed Arabic words: These are a set of words 
molded by Arabs in ancient times that do not obey the 
Arabic derivation rules. Most of these fixed words are 
neither verbs nor nouns, most of them are functional 
words like pronouns, prepositions, conjunctions, 
question words, and the like. They may be best 
regarded as the glue that ties the words of the Arabic 
sentence together [16]. 

Arabized words are words borrowed from foreign 
languages (perhaps with some phonetic adjustments to 
suit the Arabic pronunciation) that have become 
common among the native Arabic speakers. To 
preserve the purity of the Arabic language, we prefer 
to avoid a word in this category unless its meaning has 
no counterpart among the original Arabic words. 
Figure 1 [14] summarizes this classification of the 
Arabic words. 



Arabic word 




Arabized 



Derivative Fixed 

Figure 1. The Classification of Arabic Words 
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Any treatment of Arabic must treat all of these 
categories with the same degree of care [15]. 

4. Arabic is a Diacritized Language 

The pronunciation of a word in a non-diacritized script 
is almost always fully determined by its constituent 
characters, so that the sequence of consonants and 
vowels determines the correct phonetics. Such a 
language is Spanish or Finnish [15]. 

On the other hand, in diacritized scripts the 
pronunciation of the words cannot be fully determined 
by their constituent characters only, special marks are 
put above or below the characters to determine the 
correct pronunciation. These marks are called 
diacritics. In such languages, two different words may 
have identical spelling whereas their pronunciations 
and meanings are totally different. Arabic script 
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involves an elaborate diacritization system. Table 1 
shows the Arabic diacritics and the significance of 
each one [2, 16]. 

During the process of assigning diacritics we need to 
determine each two kinds of diacritic information 
about the character: things: 

1. The shadda state of the character, (with /without 
Shadda.) 

2. The diacritic of the character. 

Unfortunately, in most Arabic writing today, people do 
not explicitly include diacritics. They expect their 
readers to depend on their knowledge of the language 
and the context to supply the missing diacritics while 
reading a non-diacritized text. They only mention 
diacritics in writing when a severe ambiguity is feared 
or in texts designed for educational purposes. [16]. 



Table 1 : The Arabic Diacritics and the Shadda 



Diacritic 


Name 


Sounds like 


Examples 


Comments 


(i) 


9 


Fateha a^ 


a 







(2) 


9 


Damma sli 










(3) 


9 


Kasra ">j~S 


e 







(4) 


9 


Sokoon 0j£i. 


A non 
vowelized 
consonant 


s 




(5) 


& 


Tanween fateha 


Fateha + a 




Only the last character may be assigned this 
diacritic 


(6) 


9 


Tanween damma 


Damma + a 


2> j*# t^ljjCUvt Ljyj23 Lj*2J>- 


Only the last character may be assigned this 
diacritic 


(7) 


9 


Tanween kasra 


Kasra + a 




Only the last character may be assigned this 
diacritic 


(8) 


tf tjtl 


Vowel x- 


Long (a), (e), 
or (o) vowel 







(9) 


1$ 


Alef leyna 


Long (a) vowel 




Only a terminal ^may be assigned this diacritic. 


(10) 


\ 


Bypassed 
character 


Not 
pronounced 


dkij 




(11) 


9 


Hidden alef vowel 


Long (a) 


t y^jJI tttUi tl-u 





(12) 


9 


Shadda sii 


J+ 3 - J yL 

o+o - o s v l5t 


In fact, shadda is not a diacritic but is a mark of 

doubling the character while pronouncing it. The 

character with a shadda needs another diacritic 

(from no.l to no.7) to determine its vowel. 
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An Arabic word may appear in any of three 
diacritization states: 



1. Full diacritization: This means the assignment of 
all the diacritic information for each character in 
the word including the last one. In Arabic, the 
diacritization of the last character sometimes 
depends on the syntactic analysis of the word 
within its sentence [2]. 

2. Half diacritization: This the same as full 
diacritization except for that it does not include the 
diacritic mark of the last character if it depends on 
the syntactic analysis of the word [2]. 

3. Partial diacritization: Any other diacritization 
state of the word that provides less diacritic 
information than half diacritization is called partial 
diacritization [2]. 

5. The Prefix-Body-Suffix Structure of the Arabic 
Word 

While all languages allow us to express the same ideas 
using a variety of sounds, they differ a great deal in the 
ways they provide for stringing concepts together. 
One scale on which they differ is the 
"analytical/agglutinative" scale. An agglutinative 
language allows the speaker to glue multiple 
morphemes together into a single word; an analytical 
language divides them into separate words. English is 
a rather analytical language, French is even more so; 
Arabic is much more agglutinative, though not so 
much as modern Finnish or Turkish [2]. Arabic word 
may correspond to a single entity but can as well be 
compounded of more than one entity. In fact it may be 
a phrase or even a complete sentence. So, the Arabic 
word is in general a complex. If we study a sufficiently 
large sample of Arabic text, we can infer the following 
general simple structure for Arabic words: 

a. The main part of a noun or a verb, occurs in the 
middle. Let us call this part the body of the word. 

b. The body may be prefixed by a definitive article, a 
preposition, a gender determiner, a tense determiner, 
etc., or some combination of these. When a prefix 
precedes a body, it may slightly modify its string and 
also be slightly modified. We should note that the 
prefix cannot be a standalone word. 

c. The body may also be suffixed by a pronoun, a 
gender determiner, a tense determiner, etc., or some 
combination of these. When a suffix follows a body, it 
may slightly modify its string and also be slightly 
modified. We should also note that the suffix cannot 
be a standalone word [2]. 



6. Arabic word categories 

Arabic grammarians traditionally classify words into 
three main categories. These categories are also 
divided into subcategories, which collectively cover 
the whole of the Arabic language, these categories are: 

1. Nouns 

A noun in Arabic is a name or a word that describes a 
person, thing, or idea. 

The linguistic attributes of nouns that have been used 
in our tagset [14] are: 

• Gender: Masculine Feminine Neuter 

• Number: Singular Plural Dual 

• Person: First Second Third 

• Case: Nominative Accusative Genitive 

• Definiteness: Definite Indefinite 

2. Verbs 

Verbs indicate an action, although the tenses and 
aspects are different. Verb aspect is divided into three 
classes: Perfect, Imperfect, and Imperative. 
The verbal attributes are [14]: 

• Gender: Masculine Feminine Neuter 

• Number: Sinular Plural Dual 

• Person: First Second Third 

• Mood: Indicative Subjunctive Jussive 

3. Particles 

The Particle class includes: Prepositions, Adverbs, 

Conjunctions, Interrogative Particles, Exceptions and 

Interjections. 

The subcategories of particle are [14]: 

Prepositions Adverbs Conjunctions 

Interjections Exceptions Negatives 
Answers Explanations Subordinates 

7. Our Lexicon Approach 

The objective of our project is to build a Lexicon for 
the Arabic language by automatic means. This lexicon 
contains morphological information, part-of-speech 
tags, linguistic attributes, patterns and affixes for all 
lexicon entries. 

Our new algorithm for constructing a lexicon for the 
Arabic Language automatically starts by entering a 
vowelized or non-vowelized Arabic text document 
taken from the Holy Qur'an and 242 Arabic abstracts 
chosen from the Proceedings of the Saudi Arabian 
National Computer Conference. It ends with a lexicon 
for the Arabic Language. Figure 2 shows the main 
stages for constructing a Arabic language lexicon 
using our system. 

To achieve the objective of the project, we have 
designed and implemented several processes that carry 
out separate and well-defined tasks that can be re-used 
in other natural language processing systems. 
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Hi 



Start 



Input 

Vowelized or 

Non-Vowelized text 




Automatic Arabic 
language lexicon 



End 



Figure 2. Algorithm for the Automatic Construction of an Arabic Language lexicon 



Tokenizing process 

This process locates a document and isolates the words 
(tokens); these tokens are stored in a work table that 
contains all the information associated with each token 
in the document. These tokens are compared with the 
contents of the STOPWORD and LEXICON tables; if 
they are stopwords or they have been already stored in 
the lexicon then their lingustistic attributes are stored 
in the WORD table, where each token is given an 
automatic unique number, it is stored with its word 
number and document number, and the process 
continues with the next token. 

The Stopwords matcher is constructed of the 
STOPWORDS table in the project database which 
contains many Arabic stop words along with their 
lexical information. These stopwords were compiled 
by [20]. 



The Stopwords matcher process compares the words 
in the document with words in the STOPWORDS 
table. If it is matched then the lexical information 
about the word stored previously in the 

STOPWORDS table is assigned to the words in the 
document as its tag and other attributes and processes 
continue with the next token. 

The Stemming Process 

This process is designed to extracting the root of all of 
the words in the document. The stemming process 
extracts roots constructed of three letters and stores the 
root in the Root attribute in the WORD table. The root 
of the word is the most important morphological 
attribute since many processes use the root of the word 
to accomplish its task. Many morphological systems 
have been built to extract the roots of the Arabic 
words, e.g. Al-Fadaghi and Al-Anzi [3]; we used an 



119 



InternationalJournal of Computing & Information Sciences 



Vol. 2, No. 2, August 2004 



algorithm for extracting the root of the Arabic word 
designed by Al-Shalabi [5]. 

Affix Extraction Process 

This process extracts the affixes from the word. 
Affixes are of three types: prefixes are the extra 
characters added to the beginning of the word; infixes 
are the extra characters added to the middle of the 
word; suffixes are the extra characters added at the end 
of the word. By extracting the root of the word, we are 
specifying the original characters of the word, so all 
other letters that form the word are extraneous 
characters This process determines the affixes of each 
of the words and stores them in the lexicon. 

The Pattern Recognizer Process 

This process extracts patterns from the Arabic word 
documents. The pattern recognizer identifies relative 
pronouns attached to the end of the verbs and the 
definiteness letters, progress verb letters, order verb 
letters, prepositions, conjunctions such as "j", "<•*",.... 
etc. attached to the beginning of the word These 
affixes are not part of the word and should be avoided 
when the pattern is recognized. 

The pattern is constructed by combining the letters 
"i-«", "£."," J" with the affixes of the word according to 



their order in the word. Then the patterns are stored in 
the lexicon with the word after they have been 
extracted. 

The Part-of-Speech Tagging Process 

This process assigns the part-of-speech tags for all 
lexicon entries. We used the full automatic Arabic text 
tagging system implemented by Kanaan, Al-Shalabi 
and Sawalha [14]. Then part-of-speech tags are stored 
in the lexicon. 

Storing the Words in the Lexicon 

This process is responsible for storing new words in 
the LEXICON table in the project database. When all 
of the operations have been finished, all tokens of the 
document have been processed and stored in the 
WORD table. Along with each word are stored all its 
attributes such as part-of-speech tag, root, pattern, 
affixes, relative pronouns, and conjunctions attached to 
the token. This process will transfer all new words, not 
already found, to the lexicon along with all associated 
lexical attributes. 

Once this process has finished, the user can view all 
words stored in the lexicon on the screen shown in 
Figure 3. 



i3jj— "1 




4u|jti| 4jHtS| 


uL^I 


^u| i£^| 'La^jtjl 


u^ill 


j^^il 


u-^i 


p >i jii| c rth!.^^! 


j I t *"■-■*-■! 


I -Wl| 






J-i 


Liia 


,ja 






yiU 


V<r/t 


jSili 


t^La fr£ 


*j 


(J lb 


isijti 


(jjijj 




j JL 




tMi 


JjLa 


j£it 


LljU >w| 


f-i 


Luijj 


Jo la 


t-r" 


uMJ 








Jjid 


j3ii/kjij 


,.^iU jfci 


J-s 


wCjUw 


i.t.i 


Jjo 




bj£i 




uJli 


Jjid 


jl*4 


E-l dJ >v)| 


f**l 


Jblit 


Uka 


fr* 




Bj& 


jjj*" 


Jli 


Jjiri 


j£i* 


UjM >Vi| 


|bUi] 


<JWJ 


Jkilu.1 


£*J 




bj&i 


jjjv 


yili 


Jj*J 


j£i4 


Ujt-4 >w| 


fd 


^lijlutf 


j.ii 


J^c 




j .£i 


jjj^ 


JIB 


JjLa 


JL±A 


LljU >vi| 


f*tf| 


Litti 


Uti 


jt* 




Dj£i 


JJJM 


yili 


JjAj 


jSia 


i— iyJ 1jjj| 


f-Ul| 


>X*y> 


<J2)*Aa 


JiJ 




* JL 


JjJ" 


Jli 


t* 


■_U^H 


LljU >vi| 


f-uj| 


Cj^^IaIu 


- 




















~ 



Figure 3. The Lexicon 



Tagging Nouns 

We have constructed many morphological rules that 
identify the words as nouns in the Arabic language. 
Rules for extracting nouns from documents are 
constructed according to the special grammar of the 
Arabic language. This grammar includes the affixes of 
the word. 

• Prefixes such as; "<JI", "tP",etc. 

• Suffixes such as; "•", "f ", "<s", etc. 

• Diacritic Marks attached to the first and last letters 
of the word. 



The position of the word in the sentence is a good 
indicator in identifying nouns. Some words are always 
followed by nouns, such as "LpJ^lj c)l-£", " Ci\ 
L$j)jilj" ; and some of these words are mainly used in 
recognizing proper nouns such as "jj-«J1", which 
means "Mr", " 4 SLml t" which means "kingdom", etc. 
Thus we can construct a rule to help us identify nouns 
in the text using these phenomena. 
The Arabic Language has many patterns; some of 
them apply to nouns only; some of them apply to 
verbs only; and some apply to both nouns and verbs. 
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We recognize some of those used as noun patterns, 
such as " Jcli", " Jj*i«", etc. 
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Tables 2 to 6 show a complete set of rules for tagging 
declinable nouns. 



Table 2 : Rules for Suffixes of Declinable Nouns 



Rule# 


Rule Description 


Example 


Noun type 


Linguistic 
attribute 


Rulel 


Any word ending with "Cxi" or " Ca and 


O^Ajf-" 


(-JJX4 A^l| 


j£aa ia& 




not beginning with any of these particles 


Engineers 


Declinable 


Plural, 




( " i i Ci i j i£ " icjUixJI i-ijja. ) 




Noun 


Masculine 


Rule 2 


Any word ending with "^1 " if the mark 


ljLojjAa 


(-JJX4 **J1l\ 


C±i'yi cf 4 ^ 




before the particle " 1 " is " " 


Teachers 


Declinable 
Noun 


Plural, feminine 


Rule 3 


Any word ending with u < " must be a 


4JjS 


(-JJX4 **J1l\ 


diJJA JjXa 




NOUN. 


Writing 


Declinable 
Noun 


Singular, 
feminine 


Rule 4 


Any word ending with "*■ " or " is " and 


*5U 


(-JJX4 A^l| 


duJA Ij&a 




not beginning with any of these particles 


Dictation 


Declinable 


Singular 




( " i i a i $ i£ " 4*.jUa^l i-ijja ) must be a 




noun 


feminine 




NOUN. 








Rule 5 


Any word with the last letter " ls" [ *Ij 


tSJj** 


L-J^i il'lrt ^ui| 






SjjAa] and the previous letter "." [»j-^] 


Has red 


Related 






must be a NOUN. 


color 


Noun 





Table 3: Rules for Prefixes of Declinable Nouns 



Rule# 


Rule Description 


Example 


Noun type 


Linguistic attributes 


Rule 6 


Any word beginning with "J! " or 


( uUSJI) 








" Ji " or " J! " must be a NOUN. 


The book 


Declinable 
noun 




Rule 7 


Any word beginning with "lW " 


(fUl) 


(-JJX^ A^ll 


JJJ** 




must be a NOUN. 


For the 
science 


Declinable 
noun 


Genitive 


Rule 8 


Any word beginning with "?" ( ?** 
4u>y*AA ) if the char, previous to 
the last letter is " , " [ »j-^ ] must 

be a NOUN. 


it*) 


Jcli jxui! 
Agent 
noun 




Rule 9 


Any word beginning with "?" ( p* 
<u^aJm ) if the char, previous to 
the last letter is " ' " [ **&] must be 
a NOUN. 


(<C*) 


Patient 
noun 
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Rule# 


Rule Description 


Example 


Noun type 


Linguistic 
attributes 


Rule 10 


Any word following " ^ " or " ^j* " 
or " M " must be a NOUN. 


( laa U) 


^jLia a^iI 


Accusative 


Rule 11 


Any word following any of ( *-*jj* 
"... tjc- < Cfi- < Cy "j*fl ) must be a 

NOUN. 


( ? J»-i!l J* ) 


JJJ^C" 1 


Genitive 


Rule 12 


Any word following any of ( li! ' 0' " 

"V i cjjJ i J*2 i "ijs i'ljiS < ) must be a 
NOUN. ( l*3)jitj 0! ) 


( ... i-jUJuil C±xi 


L)! f**l 


Accusative 


Rule 13 


Any word following " Vl " must be 

a NOUN. 


(JAiSlVj ) 


(-JJX4 ^ull 


Accusative 


Rule 14 


Any word followed by any of these 
words [ » 4S1m i (£&»-* i lh*jj 
4jj.u , ijjj^a must be a NOUN. 


^LuJjStl iSl*«3! 


Proper noun 




Rule 15 


[ jS i jj i (O. i £! i lj! ] 


AaaI jj! 


4 ninlj flAulVI 





Table 5: Rules for Diacritic Marks of Declinable Nouns 



Rule# 


Rule Description 


Example 


Noun type 


Linguistic 
attribute 


Rule 16 


Any word ending with or or , must be 


(3*j ) 


(-JJX4 A^l| 


SjSJ 




a NOUN. (0*9*11) 


(^j) 




Indefinite 


Rule 17 


Any word in which the mark of the 
first letter is " " [<U*i] and the mark of 
the second letter is " " [4*5a] & the 
third letter is " j " must be a NOUN. 


(*£) 


jjLuaj ^uil 





Table 6: Rules for Patterns of Declinable Nouns 



Rule# 


Rule Description 


Example 


Noun 

type 


Linguistic attribute 


Rule 18 


Any word that has the weight [ J& 

OJj] ( Jf^) => 


(J^U) 


Jcii jxuiI 




Rule 19 


Any word that has the weight [ J& 








Rule 20 


Any word that has the weight [ tje- 


(Jjk.) 


J JxLa ^ull 




Rule 21 


Any word that has the weight [ J& 


(u*Ia) 


4jLL« ^uil 




Rule 22 


Any word that has the weight [ J& 


(4C>»)i(JJi») 






Rule 23 


Any word that has the weight [ J& 

u jj] ( J*M 


(X>^) 


SJijwl 





Tagging Verbs 

This process is responsible for identifying verbs in the 
document. A verb is defined as a word that indicates a 
meaning by itself that is united with a tense or time. 
Verbs take words or letters as indicators such as the 
particles "jS", "L±yJ\ or pronouns, or the letters "o*", 



"^", "0" [1]. 



The rule of Arabic morphology are based on patterns, 
affixes, and combinations of the two. 



Pattern 

Verbs in the Arabic language have roots that consist 
of 3 or 4 letters. From a single root many verb forms 
can be generated according to fixed rules that add 
letters such as "I", "i", "^", "a-", "J", '>", "0", 
"-i"", "j", "(#"• ("IfcRJ-^"), These fixed rules are 
called patterns. Table 7 lists 15 different essential 
patterns. 
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Table 7: Essential Verb Patterns 








# 


Pattern 


Pattern analysis 


Added 
letters 


# of added 
letters 


J 






e 






i_i 








1 


Jul 


J 






t 






i_i 













2 


Jii 


J 






t 


t 




i_i 








e 


1 


3 


Jptl 


J 






t 


i 




i_i 








i 


1 


4 


J*i! 


J 






t 






t-i 






i 


i 


1 


5 


Jii5 


J 






t 


t 




ti 


LJ 






p tilt 


2 


6 


J&tfj 


J 






t 


i 




i_i 


LJ 






UCJ 


2 


7 


J*iJ! 


J 






t 






ci 







1 


U'l 


2 


8 


Jsual 


J 






t 




LJ 


ci 






1 


Otl 


2 


9 


3*i! 


J 




J 


t 






ti 






1 


J«l 


2 


10 


Jxilui) 


J 






t 






i_i 


LJ 


L>" 


1 


ii t^jji il 


3 


11 


Jcj*ll 


J 






t 


j 


t 


i_i 






1 


J't' 1 


3 


12 


Jki 


J 






t 






i_i 













13 


Jkii 


J 






t 






ti 


LJ 






O 


1 


14 


3kil 


J 


J 


J 


t 






Li 






1 


J.I 


2 


15 


Jiki! 


J 


J 


(1) 


t 






Li 






1 


J '0 '1 


3 



Affix 

Some affixes are used with verbs and some with 
nouns and some with both verbs and nouns. We have 
extracted 31 groups of affixes that are used with the 
essential patterns listed in Table 7; these affixes affect 
verb semantics, such as verb aspect (perfect, imperfect, 
imperative), gender (masculine, feminine), number 
(singular, dual, plural), and person (first, second, 
third), and mood (indicative, subjunctive, jussive) as 
shown in Table 8. 

The number property of words that have patterns with 
no suffixes as in rules 1 and 14 cannot be specified 



directly. To identify the number, we have to refer to 
the next word, which is typically the subject of the 
sentence, since the verb and its subject are identical in 
number property. For example, "lhj- 1 -^ v-^ 1 ljjS" 
which means "the student wrote the lesson", the verb 
"ljjS" "wrote" in this sentence is a singular verb, while 
it is dual in the sentence "ou^ 1 OM 1 ^ 1 S"*", which 
means "the two students wrote the lesson", and plural 
in the sentence "ou^ 1 v^-^ 1 L ±&" which means "the 
students wrote the lesson". By referring to the subject 
we can determine the number of the verb. 



Table 8: Verb Affixes Rules 



# 


Rule 


Category 


Gender 


Number 


Person 


Mood 


1 


Pattern 


1 Perfect 


1 Masculine 


1+2+3 


3 Third 


Indicat 


ve 


2 


Pattern + 1 


1 Perfect 


1 Masculine 


2 Dual 


3 Third 


Indicat 


ve 


3 


Pattern + h 


1 Perfect 


1 Masculine 


3 Plural 


3 Third 


Indicat 


ve 


4 


Pattern + lj 


1 Perfect 


2 Feminine 


1 Singular 


3 Third 


Indicat 


ve 


5 


Pattern + ^ 


1 Perfect 


2 Feminine 


2 Dual 


3 Third 


Indicat 


ve 


6 


Pattern + 


1 Perfect 


2 Feminine 


3 Plural 


3 Third 


Indicat 


ve 


7 


Pattern + ^ 


1 Perfect 


1 Masculine 


1 Singular 


2 Second 


Indicat 


ve 


8 


Pattern + ^" 


1 Perfect 


3 Neuter 


2 Dual 


2 Second 


Indicat 


ve 


9 


Pattern + ^ 


1 Perfect 


1 Masculine 


3 Plural 


2 Second 


Indicat 


ve 


10 


Pattern + lj 


1 Perfect 


2 Feminine 


1 Singular 


2 Second 


Indicat 


ve 


11 


Pattern + O 2 


1 Perfect 


2 Feminine 


3 Plural 


2 Second 


Indicat 


ve 


12 


Pattern + & 


1 Perfect 


3 Neuter 


1 Singular 


1 First 


Indicat 


ve 


13 


Pattern + 1J 


1 Perfect 


3 Neuter 


3 Plural 


1 First 


Indicat 


ve 


14 


cS+Pattern 


2 Imperfect 


1 Masculine 


1+2+3 


3 Third 


Indicative 



123 



InternationalJournal of Computing & Information Sciences 



Vol. 2, No. 2, August 2004 



15 


cS+Pattern+O 1 


2 Imperfect 


1 Masculine 


2 Dual 


3 Third 


Indicative 


16 


cS+Pattern+uJ 


2 Imperfect 


1 Masculine 


3 Plural 


3 Third 


Indicative 


17 


'j+Pattern 


2 Imperfect 


2 Feminine 


1 Singular 


3 Third 


Indicative 


18 


^+Pattern+ u' 


2 Imperfect 


2 Feminine 


2 Dual 


3 Third 


Indicative 


19 


cS+Pattern+O 


2 Imperfect 


2 Feminine 


3 Plural 


3 Third 


Indicative 


20 


^J+Pattern 


2 Imperfect 


1 Masculine 


1 Singular 


2 Second 


Indicative 


21 


^+Pattern+ £l 


2 Imperfect 


3 Neuter 


2 Dual 


2 Second 


Indicative 


22 


^+Pattern+ Cxs 


2 Imperfect 


1 Masculine 


3 Plural 


2 Second 


Indicative 


23 


^+Pattern+ Ca 


2 Imperfect 


2 Feminine 


1 Singular 


2 Second 


Indicative 


24 


^+Pattern+ u 


2 Imperfect 


2 Feminine 


3 Plural 


2 Second 


Indicative 


25 


1 +Pattern 


2 Imperfect 


3 Neuter 


1 Singular 


1 First 


Indicative 


26 


+Pattern 


2 Imperfect 


3 Neuter 


3 Plural 


1 First 


Indicative 


27 


l+Pattern 


3 Imperative 


1 Masculine 


1 Singular 


2 Second 


Indicative 


28 


l+Pattern + 1 


3 Imperative 


3 Neuter 


2 Dual 


2 Second 


Indicative 


29 


l+Pattern + lj 


3 Imperative 


1 Masculine 


3 Plural 


2 Second 


Indicative 


30 


l+Pattern + c? 


3 Imperative 


2 Feminine 


1 Singular 


2 Second 


Indicative 


31 


l+Pattern + u 


3 Imperative 


2 Feminine 


3 Plural 


2 Second 


Indicative 



Rules 

Rules are extracted from the syntax of the Arabic 
sentence formation; tags are assigned to verbs 

Lexical Attribute Rules for the Arabic 
Language 

Once the type and major subtype of the word have 
been identified, another process is needed to obtain the 
linguistic attributes of the word (Person, Number, 
Gender, Aspect, and Mood). Each attribute requires 
special treatment. 

1- Gender (Masculine, Feminine) 

We assumed that all Arabic words are masculine 
except those words ending with "2 — " , "* )" , "<$" or 

"dil", which are feminine. 

2- Number (Singular, Plural, Dual) 

If a word ends with "Oj" or "o— /' and does not begin 
with any of the letters "cs "^ 'i '0" "iejU!a«J) ui>*l" ; 
then the number attribute of the word must be 
masculine plural "(J — u. jSi—a j^" and if a word ends 
with "Ol" it must be feminine plural "(J— ■ <^J>» fr**" 

Any noun that ends with "0'" or "lh" must have dual 

number attribute; other words will be assumed to be 

singular. 

3- Person (First, Second, Third) 

This lexical attribute is used for pronouns only 

whether they are attached to the word or separate. 

Pronouns indicate first, second and third person as 

follows: 



according their position in the Arabic sentence, where 
some types of pronouns, prepositions and letters are 
affixed to verbs. Some of these rules are: 

J*i + {"i-ij ui" i "ja" } 

1- First person pronouns : ( ij 'cP-' <& <U nil) 

2- Second person pronouns: 'I '0 'c£ 'f 3 '^ '£") 
(<j£ ti£ tld£ t^ ttil 

3- Third person pronouns :("o* <<^a <<»a <L«a ' j* ) 

4- Case (Nominative, Accusative, Genitive) 

The case of any singular, feminine or plural noun is 

determined according to the following rules: 

• Nominative "£>^j-»": if the word ends with a letter 
that has the diacritic mark " " "<u«aJr 

• Accusative "ij>^i»": if the word ends with a letter 
that has the diacritic mark " " " Aaoiil " 

• Genitive "jj>?-*": if the word ends with a letter that 
has the diacritic mark ", " "Sj^l" 

The case of any masculine and plural noun is 
determined according to the following rules: 

• Nominative "£ja>«": if the word ends with "£j" 

• Accusative ">^j mW: if the word ends with "£ 



and it is not preceded by any preposition and the 
previous word does not have the genitive case. 

• Genitive "jjj±a": if the word ends with "£#" and it 
is preceded by any preposition or the previous 
word has genitive case. 

The case of any dual noun is determined according to 
the following rules: 
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• Nominative "£ j£>a" if the word ends with "&I" 

• Accusative "uj^aia" if the word ends with "&f and 

it is not preceded by any preposition and the 
previous word does not have the genitive case. 

• Genitive "jjja-a" if the word ends with "&£' and it 

is preceded by any preposition or the previous 
word has genitive case. 

5- Definiteness (Definite, Indefinite) 

We assume that the definiteness attribute of a noun is 

Indefinite (»j*j) except for these types of nouns. 

1- Proper nouns. "fWt ^yil" 

2- Any noun made definite by "J" (the). 

3- Any noun following another definite noun. " tiUa-a 

4- Any noun following "Li" , "* \i—Jl\ ljj^I" " Sj_Sil) 

Some other nouns are always definite (4ij*-«) such as: 
1- Pronouns "jiU*aH". 2- »J-&$\ p—t 3- *U_u,VI 
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8. Implementation 

We built a program that applies all of the rules 
described in this paper using MicroSoft Visual Basic 
6.0 and MicroSoft Access Database. The figures below 
show the program screens used to input text and 
display the results of applying rules to build our 
lexicon for the Arabic language automatically. 
Figure 4 is designed to facilitate the entry of new text 
by pressing the button labeled "£la" (1) and selecting 
the document using the open dialog box displayed; the 
new text is shown in the form text box, and a unique 
document number must be entered for each document. 
The button labeled "JjI»j" (2) performs tokenizing, 
stemming, affix extraction and pattern generation 
processes and then the Word List screen shown in 
Figure 5 will be displayed. The button labeled " fSUlutfl 
4jjjj dl_«jk4 &c" (3) allows the user to inquire about 
any document that has been processed and stored in 
the project database before; the document number 
should be entered before pressing this button. Then the 
Word Display Screen shown in Figure 5 is displayed. 
The button labeled "LEXICON *^J±* o*>*" (4) 
allows users to view the contents of the lexicon stored 
in the project database, the screen shown in Figure 7 
will be displayed. Pressing the last button labeled 
"»1 i S t" (5) ends the program. 



ilZ > >: iq ; -^i JlUu^'i *jb\ : Jj^ j£ u3 ^.jii— ill .* . -~< *I'i ^i£. '_i'„o_J'i J LljL^IuJI ujL jbj J'l ■ ■- ■'' ^%j 
■ .i i *>i i J ,*~*^« —I 4il±iaVlj "*■ ■ ■■ ^ ^vi_i ,i * * ■ ■ ■ ■ * ' j uji ;ij (J Ai J^a -. iiij-i _i >jjL«i IgJ ^jl£ J ;■ ;. 4j_i 

1 "" dli , .j-i » ■ .*■ ■ ■' tljlxSjpij , ^ >iiL-j J J r i ■- ", i 4~aj!XdJ S^jjjlia jlliljjl U.I ;u _ -i-i ■ '- ;^- ' J U i_a_i , ;uj.*_j 
aLcj J ,'^" ' ^3 j j»li U^A? ^J^l JLa-a ^ *-? ^3% j^ ^^ ^i^^ J5^ [ : * '" '^ •" ■J' 1 ^^'- 1 ^ ^ j"'J^. °^ /' ' - ^ ■■ 




Figure 4. The Current Document Scanned by the System 



The screen shown in Figure 5 shows all the words 
(except for the stop words) extracted from the most 
recent document after applying tokenizing, stemming, 



and the affix removal and pattern generation processes. 
The table on the screen shows the word, its root, extra 
letters attached to the beginning of the word such as 
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conjunctions, letters indicating imperfect verbs, 
prefixes, two groups of infixes, suffixes, pronouns, 
and the pattern. When the button labeled "* JjiaJil" (1) 
is pressed the part-of-speech tagging process is applied 
to the word shown on this screen and the linguistic 



attribute extraction process is also applied to these 
words. This information is displayed on the screen 
shown in Figure 6. Pressing the last button labeled 
"f l$J)" (2) ends the program. 




Figure 5. All the Words except the Stop Words 



The screen shown in Figure 6 is designed to display 
linguistic information about the words in the processed 
document other than stop-words. The screen shows the 
word, the main part-of-speech category, the 
subcategory of the part-of-speech tag and the linguistic 
attributes (gender, number, person, case, definiteness, 
aspect, and mood). The button labeled "^5U2-uil" (l) 
displays the lexicon after processing the document as 
shown in Figure 7. The button labeled "4j-"^j^ 4-ajIM" 
(2) shows the title screen. Pressing the button labeled 
'flfil" (3) ends the program. 




The screen shown in Figure 7, is designed to display 
the lexicon s constructed automatically so far. The 
table in this screen displays the lexicon information 
about a given word, its linguistic attributes (gender, 
number, person, case, definiteness and mood), it also 
displays the root and the pattern. The button labeled 
"£i*j" (1) displays the search screen where we can 
search for specific words stored in the lexicon 
database. The button labeled "£>?-j" (2) shows the 
title screen. Pressing the button labeled "fi-fJI" (3) 
ends the program. 



***j|jO*l Cu^i u" -™ 



^J^Li- 


^Li 


£j3>- 


i^jSUs 


Jj^S-" 


^Jli 



.jptjiSI <-L 



4*2 Lu J * » ■ — ^ ■'■! 















Figure 6. Displaying a Word and Its Attributes 
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1 












1 v< ,, 


j^^i 


Aijljttl 4jSM 


*i.fX 


AuljeKI 4jSM 


u~Ltt 


JJ jL£I 


, *j 


,^ jtl\ < *>t-»:t| 


jjjiij J| r <■! . ■» , fl| 


I .Ml| 






^M ^^ uu 6 j£j jjj^-" t_ul£ 


JjLl 


i> 


Uj" jNflj 


^Ul| 


4ua| 




llkj 


j* 3 - 




Bjii 




LjSli 


jjLt 


tM%» 


UjlU jjujj 


[uuj 


S J**I 




Bjkia 


%j 




Bjli 


JJ>M 


ySli 


SjLt 


Jth 


■-aj*A >w 


^ul 


3aU'J 




1- 


„U 




■: JL, 


Jjj" 


uile 


JjL* 


J&i 


^ r 


f—i 


(jhU*i| 




H^^uj^jbil 


0* 




■■ fii 


tjij* 


uSk 


^ 


J&* 


U^U >w 


t*\ 


.j 1 ■■'-■'■'! 






f Ji 




■:■ Jj 


jjj^ 


Ljlli 


JjL* 


jiis 


yjn^ 


f&A 


»|JiIui| 




fJi 




■■ JL 


jjj*- 


uSSfc 


f» 


ijlijj 


U^Lt >VI 


ju4 


■ ■'-!'- 1 




B^u^i 


f Ji 




tjl 




iJli 


J". 


ij. 


^ya 


■&I 


i'lj La| .Ji-i-wl 




Hj^i 


f Ji 




>P 


^^ 


Ujlf 


Jji« 


jii« 


.ji H 


j»Urj 


tOUl 




li>i 


U*"l 




Bj£j 


jjju 


ljSIs 


JjiJ 


jijj 


UjlU %to* 


$*4 


(JlMJll 




luJ 


H^» 




4SjM 


jjj^ 


uilC 


JjL* 


■JjU^J* 


i-JjU >w 


f*d 


■fed 




Mi^i 


$U 




ikfj. 


jjj^ 1 


ulli 


J ■-*"'■ 


aij. 


^^ 


f*d 


UUJI 






* 




Ujm 


jjj^ 


LaSlt 




jSia 


U J" f" 


>jj| 


(jl^SuJI 




f Ji 




*3jlj 


jjj^ 


Jli 


JjL* 


jiis 


t-ajJU >uj 


|hHt| 


■lii&tfl 




Iji^a 


jLi 




Uju 


jj^" 


yili 


J>j 
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U^Lt ivi 


^ul 


juusi 




. H H ^iui jSJ; Ai^u < L-ui£ 


Jjij 


jijj 


UUU4 jkui| 


[tail 


^SliUtfl 


































f 1 i" 51 
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Figure 7. Displaying the Lexicon 



The search screen shown in Figure 8 is invoked by 
pressing the button labeled "iii*/' (l) in the fifth 
screen shown in Figure 7. This screen allows the user 
to search the lexicon database for an item that matches 
the word entered in the text box labeled "<L«l£J)" (l) on 
the screen. It then displays the information stored in 



the lexical entry for that word. To execute the search 
process the user enters the word and presses the 
button labeled "iii^" (2). The button labeled "£>*j" 
(3) displays the previous screen, the fifth screen, 
shown in Figure 7. The button labeled "fL$JI" (4) 
terminates the program. 




Figure 8. The Search Screen 



9. Results 

We tested our system using passages from the Holy 
Qur'an, which are vowelized, and another set of non- 
vowelized Arabic abstracts chosen from the 



Proceedings of the Saudi Arabian National Computer 
Conference. We run our system on a group of these 
documents selected randomly; we obtained the results 
shown in Table 9. 
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System model 


# 
Words 


# 

incorrect 

Words 


# 
correct 
Words 


% 

incorrect 

Words 


% 
correct 
Words 


Total 

% 


Stemming Process 


388 


69 


319 


% 17.78 


%82.22 


%82.22 


Pattern Analyzer 
Process 


319 


6 


313 


% 1.88 


%98.12 


%98.12 


Part-of-Speech 
Tagging 


313 


11 


302 


% 3.50 


%96.50 


%96.50 


Lexical 
Attribute 
Analyzer 

Process 


Gender 


302 


27 


275 


% 9.95 


% 91.05 


%96.03 


Number 


302 


18 


284 


% 5.96 


% 94.04 


Person 


302 


13 


289 


% 4.30 


% 95.70 


Case 


302 


9 


293 


% 2.98 


% 97.02 


Definiteness 


302 


1 


301 


% 0.33 


% 99.67 


Mood 


302 


4 


298 


% 1.33 


% 98.67 



When we calculate the system's efficiency, we 
discard the errors coming from the stemming process, 
since the focus of the research is on constructing an 
Arabic lexicon automatically. Other essential parts of 
the system are analyzed and the efficiency of each 
part is calculated. Faults in the system were caused 
by some uncontrolled conditions; the stemming 
algorithm used in our program is designed for 
extracting roots constructed of three letters, however 
some roots have four letters, which we don't handle 
in our system. Another factor that affects the 
efficiency of the system is the incorrect roots 
extracted when some of the word's letters are doubled 
and the doubled letters are marked with shadda """, 
which is not a diacritic but is a mark that the 
character is doubled when it is pronounced. Errors in 
the number attribute occurred because some plurals 
in the Arabic language can be formed irregularly and 
some singular or dual words have the shape as the 
plural words, which makes detecting this attribute 
automatically a very hard task. 
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